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SOME SHARP PERFORMANCE BOUNDS FOR LEAST 
SQUARES REGRESSION WITH L x REGULARIZATION 

By Tong Zhang 1 

Rutgers University 

We derive sharp performance bounds for least squares regression 
with Li regularization from parameter estimation accuracy and fea- 
ture selection quality perspectives. The main result proved for L\ 
regularization extends a similar result in [Ann. Statist. 35 (2007) 
2313-2351] for the Dantzig selector. It gives an affirmative answer 
to an open question in [Ann. Statist. 35 (2007) 2358-2364]. More- 
over, the result leads to an extended view of feature selection that 
allows less restrictive conditions than some recent work. Based on 
the theoretical insights, a novel two-stage Li-regularization proce- 
dure with selective penalization is analyzed. It is shown that if the 
target parameter vector can be decomposed as the sum of a sparse 
parameter vector with large coefficients and another less sparse vec- 
tor with relatively small coefficients, then the two-stage procedure 
can lead to improved performance. 

1. Introduction. Consider a set of input vectors xi,...,x n G R d with 
corresponding desired output variables yi,. .. ,y n . We use d instead of the 
more conventional p to denote data dimensionality, because the symbol p 
is used for another purpose. The task of supervised learning is to estimate 
the functional relationship y ~ /(x) between the input x and the output 
variable y from the training examples {(xi,yi), . . . , (x n ,y n )}. 

In this paper, we consider the linear prediction model /(x) = /3 T x and 
focus on least squares for simplicity. A commonly used estimation method 
is L\ -regularized empirical risk minimization (aka, Lasso) 



(1) /3 = argmin 

f3£R d 



1 n 

-yj^-y^+ApHx 



n . 
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where A > is an appropriate regularization parameter. 

We are specifically interested in two related themes: parameter estima- 
tion accuracy and feature selection quality. A general convergence theorem 
is established in Section 4, which has two consequences. First, the theorem 
implies a parameter estimation accuracy result for standard Lasso that ex- 
tends the main result of Dantzig selector in [4]. A detailed comparison is 
given in Section 6. This result provides an affirmative answer to an open 
question in [10] concerning whether a bound similar to that of [4] holds for 
Lasso. Second, we show, in Section 7, that the general theorem in Section 
4 can be used to study the feature selection quality of Lasso. In this con- 
text, we consider an extended view of feature selection by selecting features 
with estimated coefficients larger than a certain nonzero threshold. This 
method is different from [20], which only considered zero threshold. An in- 
teresting consequence of our method is the consistency of feature selection, 
even when the irrepresentable condition of [20] (the condition is necessary in 
their approach) is violated. Moreover, the combination of our parameter es- 
timation and feature selection results suggest that the standard Lasso might 
be sub-optimal when the target can be decomposed as a sparse parameter 
vector with large coefficients, plus another less sparse vector with small coef- 
ficients. A two-stage selective penalization procedure is proposed in Section 
8 to remedy the problem. We obtain a parameter estimation accuracy result 
for this procedure that can improve the corresponding result of the standard 
(one-stage) Lasso under appropriate conditions. 

For simplicity, most results (except for Theorem 4.1) in this paper are 
stated under the fixed design situation (i.e., are fixed, while are ran- 
dom). However, with small modifications, they can also be applied to random 
design. 

2. Related work. In the literature, there are typically three types of 
results for learning a sparse approximate target vector (3 = [(3\ , . . . , fid] £ R d 
such that E(y|x) ~ /? T x. These results are as follows: 

1. Feature selection accuracy. Identify nonzero coefficients (e.g., [5, 18, 19, 
20]), or more generally, identify features with target coefficients larger 
than a certain threshold (see Section 7). That is, we are interested in 
identifying the relevant feature set {j : \/3j \ > a} for some threshold a > 0. 

2. Parameter estimation accuracy. How accurate is the estimated parameter, 
comparing to the approximate target (3, measured in a certain norm (e.g., 
[1, 2, 4, 5, 9, 13, 17, 19])? That is, let j3 be the estimated parameter; we 
are interested in developing a bound for \\(5 — (5\\ p for some p. Theorems 
4.1 and 8.1 give such results. 

3. Prediction accuracy. The prediction performance of the estimated pa- 
rameter, both in fixed and random design settings (e.g., [2, 3, 5, 11, 
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13, 14]). For example, in fixed design, we are interested in a bound for 
I J2?=i0 T *i - Ey 4 ) 2 or the related quantity ± E?=i((/3 ~ P) T Xi) 2 - 

In general, good feature selection implies good parameter estimation, and 
good parameter estimation implies good prediction accuracy. However, the 
reverse directions do not usually hold. In this paper, we focus on the first two 
aspects, feature selection and parameter estimation, as well as their inter- 
relationship. Due to the space limitation, the prediction accuracy of Lasso 
is not consider in this paper. However, it is a relatively straight-forward 
consequence of parameter estimation bounds with p = 1 and p = 2. 

As mentioned in the Introduction, one motivation of this paper is to 
develop a parameter estimation bound for L\ regularization directly com- 
parable to that of the Dantzig selector in [4]. Compared to [4], where a 
parameter estimation error bound in 2-norm is proved for the Dantzig se- 
lector, Theorem 4.1 in Section 4 presents a more general bound for p-norm 
where p £ [l,oo]. We are particularly interested in a bound in oo-norm be- 
cause such a bound immediately induces a result on the feature selection 
accuracy of Lasso. This point of view is taken in Section 7, where feature se- 
lection is considered. Achieving good feature selection is important, because 
it can be used to improve the standard one-stage Lasso. In Section 8, we de- 
velop this observation and show that a two-stage method with good feature 
selection achieves a bound better than that of one-stage Lasso. Experiments 
in Section 9 confirm this theoretical observation. 

Since the development of this paper relies heavily on parameter estimation 
accuracy of Lasso in p-norm, it is different and complements earlier work 
on Lasso given above. Among earlier work, prediction accuracy bounds for 
Lasso were derived in [2, 3] and [5] under mutual incoherence conditions 
(introduced in [9]) that are generally regarded as stronger than the sparse 
eigenvalue conditions employed by [4]. This is because it is easier for a ran- 
dom matrix to satisfy sparse eigenvalue conditions than mutual incoherence 
conditions. A more detailed discussion on this point is given at the end of 
Section 4. Moreover, the relationship of different quantities are presented in 
Section 3. As we shall see from Section 3, mutual incoherence conditions are 
also stronger than conditions required for deriving p-norm estimation er- 
ror bounds in this paper. Therefore, under appropriate mutual incoherence 
conditions, our analysis leads to sharp p-norm parameter estimation bounds 
for all p£ [l,oo] in Corollary 4.1. In comparison, sharp p-norm parameter 
estimation bounds cannot be directed derived from prediction error bounds 
studied in some earlier work. 

We shall also point out that some 1-norm parameter estimation bounds 
were established in [2], but not for p > 1. At the same time this paper was 
written, related parameter estimation error bounds were also obtained in 
[1], under appropriate sparse eigenvalue conditions, both for the Dantzig 
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selector and for Lasso. However, the results are only for p £ [0, 2] and only 
for truly sparse targets [i.e., E(y|x) = /? T x for some sparse f$\. In particular, 
their parameter estimation, bound with p = 2, does not reproduce the main 
result of [4] , while our bounds in this paper (which allows approximate sparse 
target) do. We shall point out that, in addition to parameter estimation 
error bounds, a prediction error bound that allows approximately sparse 
target was also obtained, in [1], in a form quite similar to the parameter 
estimation bounds of Lasso in this paper and that of Dantzig selector in 
[4]. However, that result does not immediately imply a bound on p-norm 
parameter estimation error. A similar bound was derived in [5] for Lasso but 
not as elaborated as that in [4]. Another related work is [19], which contains 
a 2-norm parameter estimation error bound for Lasso but has a cruder form 
than ours. In particular, their result is worse than the form given in [4], 
as well as the first claim of Theorem 4.1 in this paper. A similar 2-norm 
parameter estimation error bound, but only for truly sparse targets, can be 
found in [17]. In [9], the authors derived 2-norm estimation bound for Lasso 
with approximate sparse targets under mutual incoherence conditions but 
without stochastic noise. We shall note that the "noise" in their paper is not 
random and corresponds to approximation error in our notation, as discussed 
in Section 5. Their result is thus weaker than our result in Corollary 5.1. 
In addition to the above work, prediction error bounds were also obtained 
in [13, 14] and [11] for general loss functions and random design. However, 
such results cannot be used to derive p-norm parameter estimation bound, 
which we consider here. 

3. Conditions on design matrix. In order to obtain good bounds, it is 
necessary to impose conditions on the design matrix that generally specifies 
that small diagonal blocks of the design matrix are nonsingular. For exam- 
ple, mutual incoherence conditions [9] or sparse eigenvalue conditions [19]. 
The sparse eigenvalue condition is also known as RIP (restricted isometry 
property) in the compressed sensing literature, which was first introduced 
in [7]. ' 

Mutual incoherence conditions are usually more restrictive than sparse 
eigenvalue conditions. That is, a matrix that satisfies an appropriate mutual 
incoherence condition will also satisfy the necessary sparse eigenvalue con- 
dition in our analysis, but the reverse direction does not hold. For example, 
we will see from the discussion at the end of Section 4 that, for random 
design matrices, more samples are needed in order to satisfy the mutual in- 
coherence condition than to satisfy the sparse eigenvalue condition. In our 
analysis, the weaker sparse eigenvalue condition can be used to obtain sharp 
bounds for 2-norm parameter estimation error. Since we are interested in 
general p-norm parameter estimation error bounds, other conditions on the 
design matrix will be considered. They can be regarded as generalizations of 
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the sparse eigenvalue condition, or RIP. All of these conditions are weaker 
than the mutual incoherence condition. 

We introduce the following definitions that specify properties of sub- 
matrices of a large matrix A. These quantities (when used with the design 
matrix A = ~ J22=i x i x D wm appear in our result. 

Definition 3.1. The p-norm of a vector (3 = [fa , . . . , fa] G R d is defined 
as WWp = <Xj=i \Pj\ p ) 1/p - Given a positive semi-definite matrix A G R dxd , 
and given £, k > 1 such that £ + k < d, let /, J be disjoint subsets of {1, ... , d} 
with k and £ elements, respectively. Also, let Aj j E _R fcxfc be the restriction 
of A to indices /, Aj j E R kxi be the restriction of A to indices / on the left 
and J on the right. Define, for p E [1, oo], 

veR k ,i H v llp ueR*,i,J \\ u \\oo 

(p) ■ r ll^/,/ V llp (p) P/,7^/,J U llp 

vefl*,/ ||v|| p ueR*,i,J W u \\oo 

Moreover, for all v = [vi, . . . ,v k ] G define v p_1 = [|vi| p_1 sgn(vi), . . . , 
|v fc | p_1 sgn(v fc )], and 

(p) max(0,v T ^ J , J yP- 1 ) 
^ - ™* . IMP 5 ' 

v£R K ,I ll v l|p 

^(p) gu (vP- 1 ) T ^ J ,ju||v|| p 

ve^ S u G P ^,/,j max(0, v^/vP-^Hulloo ' 

The ratio Pa°\/ Pa\ measures the closeness to the identity matrix of k x k 
diagonal sub-matrices of A. The RIP concept in [7] can be regarded as 
PAk/P'Ak m our notation. The quantities Ak £, l^Aki an d n Ak£ measures 

(2) 

the closeness to zero of the k x £ off diagonal blocks of A. Note that p Ak = 

(2) (2) 

uJ Ak and p Ak are the smallest and largest eigenvalues oikxk diagonal blocks 
of A. It is easy to see that the inequalities fJ^u < p A u hold. Moreover, we 

can also obtain bounds on 9 Ak e , j Ak e and ^ Ak £ using eigenvalues of sub- 
matrices of A. The bounds essentially say that if diagonal sub-matrices of A 

ize k + £ 
are 0(y/l). 



(2) (2) (2) 

of size k + 1 are well-conditioned, then the quantities Aki , 7^ k t and ir A k e 



Proposition 3.1. The following inequalities hold: 
V A>kji <t> y [p Ayk - P A ,e+k)(PA,e ~ PA,e+k)> 



G 
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,(p) < r,max(0,l/p-0.5) fl (2) 
A,k,£ — K °A,kp 



J*) <t—J M / u ( 2 ) T7 >) <flW A>) 

<fr max(0,l/p-0.5) (2) (p) < «(p) / (p) 



minA.i - sup \\A It i - diag(A//)||p < < /xV- , 

7, r ' ' 



where, for a matrix B, diag(-B) is the diagonal of B, and \\B\\ P = sup u (||5u|| p / 
II u IIp)- 

(p) 

The last inequality in the above proposition shows that nX k > and 

°^Ak ^ ^ when A has a certain diagonal dominance (in p-norm) property for 
its k x k blocks. 

Finally, we state a result that bounds all quantities defined here using 
the mutual incoherence concept of [9]. This result shows that mutual in- 
coherence is a stronger notation than all quantities we employ in this pa- 
per. Although more complicated, by using these less restrictive quantities in 
Theorem 4.1, we obtain stronger results than using the mutual incoherence 
condition (Corollary 4.1). For simplicity, we consider diagonally normalized 
A such that Ai s i = 1 for all i. 

Proposition 3.2. Given a matrix A G R dxd and assuming that A^ = 1 
for all i, define the mutual coherence coefficient as M A = sup^- \ Aij\ . Then 
the following bounds hold: 



4. A general performance bound for L\ regularization. For simplicity, 
we assume sub-Gaussian noise as follows. We use j to indicate the jth 
component of vector x» G R d . 

Assumption 4.1. Assume that, conditioned on {xj}j = i r .. jn , {yi}«=i,...,n 
are independent (but not necessarily identically distributed) sub-Gaussians. 
There exists a constant a > such that Vi and G R, 



• P w <l + M A k; 



• i ( Zj kjt <M A k 1 be/max.(p,l 




M A k); 
M A k). 



E yi e^- E ^|{ Xi } 



i=l,...,n 
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Both Gaussian and bounded random variables are sub-Gaussian using the 
above definition. For example, if a random variable £ G [a, b] , then Ege'^ - *^ < 
e (b-a) 2 t 2 /8_ jf a ra ndom variable is Gaussian, £ ~ iV(0, cr 2 ), then E^e^ < 
e CT2 * 2 / 2 . 

For convenience, we also introduce the following definition. 

Definition 4.1. Let (3 = [f3\ , . . . , (3d) G R d and a > 0. We define the set 
of relevant features with threshold a as 

supp a (/3) ={j:\(3j \ >a}. 

Moreover, if > • • • > are in descending order, then define 

r^(P)=( E l/%) 
Vj=fc+l 

as the p-norm of the d — k smallest components (in absolute value) of (3. 

Consider a target parameter vector (3 G R d that is approximately sparse. 
Note that we do not assume that (3 is the true target; that is, /3 T Xj may 
not equal to Ey^. We only assume that this holds approximately, and we 
are interested in how well we can estimate (3 using (1). In particular, the 
approximation quality (or its closeness to the true model) of any (3 is mea- 
sured by ||^Er=i(/5 Tx i - Ey^XiHoo in our analysis. If /Fx; = Ey i; then 
the underlying model is y^ = /3 T Xj + £j, where £j are independent sub- 
Gaussian noises. In the more general case, we only need to assume that 
|| - Y^i=\{P T ^-i ~ Ey^XjUoo is small for some approximate target j3. The re- 
lationship of this quantity and least squares approximation error is discussed 
in Section 5. 

The following theorem holds both in fixed design and in random de- 
sign. The only difference is that, in the fixed design situation, we may let 
a = (supj Ajj) 1 / 2 , and the condition (sup^ Ajj) 1 / 2 < a automatically holds. 
In order to simplify the claims, our later results are all stated under the 
fixed design assumption. In the following theorem, the statement of "with 
probability 1 — 5: if X then Y" can also be interpreted as "with probability 
1 — 5: either X is false or Y is true." We note that, in practice, the condition 
of the theorem can be combinatorially hard to check, since computing the 
quantities in Definition 3.1 requires searching over sets of fixed cardinality. 

Theorem 4.1. Let Assumption 4- 1 hold, and let A = ^J]r=i x i x f '■ Let ft 
be the solution of (1). Consider any fixed target vector j3 G R d and a positive 
constant a > 0. Given 5 G (0, 1), then, with probability larger than 1 — 5, the 
following two claims hold for q = l,p, and all k,£ such that k<£< (d — k)/2, 
tG(0,l),pG[l,oo]: 



/ / 
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Ift<l~ -t w k l - 1/p H, A > ^(aa^-H2d/6) + \\±j:UCP T ^ 
Ey^XjUoo), and (supj Ajj) 1 / 2 < a, then 



< 



lg ~ , (P) [H A,k+£ k 
tUJ A,k+£ 



Ey^XjUoo), and (sap j Ajj) 1 ' 2 < a, then 
+ 4r^ 1) C9)€ 1 /9- 1 . 

Although, for a fixed p, the above theorem only gives bounds for \\0 — j3\\ q 
with q = l,p, this information is sufficient to obtain bounds for general q G 
[l,oo]. If q G [l,p], we can use the following interpolation rule (which follows 
from the Holder's inequality): 

\\Ap\\ q < \\A$\\^~ q)/ ^~ q) \\A$\\^- p ^-^ 

and if q > p, we use || A/3|| 9 < || A/3|| p . Although the estimate we obtain when 
q 7^ p is typically worse than the bound achieved at p = q (assuming that 
the condition of the theorem can be satisfied at p = q), it may still be useful, 
because the condition for the theorem to apply may be easier to satisfy for 
certain p. 

It is important to note that the first claim is more refined than the second 
claim, as it replaces the explicit ^-dependent term 0(£ 1/,p X) by the term 
0( r k which does not explicitly depend on I. In order to optimize 
the bound, we can choose k = | supp A (/3)|, which implies that 0(r^\ft)) = 
0(l l / p \ + rg £ (/3)) = 0(i l / p \ + ^ 1/p ~ 1 r[ 1) ( ) 5)). This quantity is dominated 

by the bound in the second claim. However, if 0(r^\f3)) is small, then the 
first claim is much better when I is large. 

The theorem as stated is not intuitive. In order to obtain a more intu- 
itive bound from the first claim of the theorem, we consider a special case 
with mutual incoherence condition. The following corollary is a simple con- 
sequence of the first claim of Theorem 4.1 (with q=p) and Proposition 3.2. 
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The result shows that mutual incoherence condition is a stronger assumption 
than the quantities that appear in our analysis. 

Corollary 4.1. Let Assumption 4-1 hold, and let A = - ^™=iXjxf, 
and assume that Ajj = 1 for all j. Define = sup^- Let (3 be the 

solution of (1). Consider any fixed target vector G R d . Given 5 G (0,1), 
then, with probability larger than 1—5, the following claim holds for all 
k<£< (d-k)/2, t G (0,1), p£ [l,oo]: if M A (k + £) < (1 - t)/(2 - t) and 

A > ^ T ll(a^ln(2d/5) + \\±Y2=x{P T Xi ~ Ey^U), then 

\\$-P\\p< 5^[1.5rWG5) + k^X] + 4r?> ((3) + *&zM r Wfflii/r-\ 

The above result is of the form 

(2) - p\\ p = 0(k l ^X + r^(P) + r^ip)^' 1 ), 

where we can let k = |supp A (/3)|, so that k is the number of components 
such that \f3j\ > A. Although mutual incoherence is assumed for simplicity, 
a similar bound holds for any p if we assume that A is p-diagonal dominant 
at block size k + 1. Such an assumption is weaker than mutual-incoherence. 

To our knowledge, none of the earlier work on Lasso obtained parameter 
estimation bounds in the form of (2). The first two terms in the bound are 
what we shall expect from the Li-regularization method (1) and, thus, un- 
likely to be significantly improved (except for the constants) . The first term 
is the variance term, and the second term is needed because L\ regulariza- 
tion tends to shrink coefficients j ^ supp A (/3) to zero. Although it is not clear 
whether the third term can be improved, we shall note that it becomes small 
if we can choose a large I. Note that, if j3 is the true parameter: /3 T x.; = Eyj, 
then we may take A = 4(2 - t)t- 1 cr v / ln(d/5)/n in (2). The bound in [4] has 
a similar form (but with p = 2), which we will compare in the next section. 

Note that, in Theorem 4.1, one can always take A sufficiently large, so 
that the condition for A is satisfied. Therefore, in order to apply the theorem, 
one needs either the condition < t < 1 — ir^ k l ~ l / p It or the condition 
< t < 1 — 7? They require that small diagonal blocks of A 

are not nearly singular. As pointed out after Proposition 3.1, the condition 
for the first claim is typically harder to satisfy. For example, as discussed 
below, even when p^2, the requirement 7^ k 1 ~ 1 ^ p /£ < 1 can always be 

satisfied when diagonal sub-blocks of A at certain size satisfy some eigenvalue 
conditions, while this is not true for the condition 7r j , „ „k 1 ~ 1 / p /£ < 1. 

In the case of p = 2, the condition < 1 - ^ k+e £ k°- 5 /£ can always be 
satisfied if the small diagonal blocks of A have eigenvalues bounded from 
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above and below (away from zero). We shall refer to such a condition as 
sparse eigenvalue condition (also see [1, 19]). Indeed, Proposition 3.1 implies 

that Trf k+eg k°- 5 /£ < 0.5(k/£) - 5 \j p^l P^+21 ~ 1- Therefore, this condition 

(2) (2) 

can be satisfied if we can find £ such that Pa W Pa k+2£ — ^/^- particular, 

(?) (2) 

if p A y ' p A k+2 i < c for a constant c > when £ < ck (which is what we 
will mean by sparse eigenvalue condition in later discussions), then one can 
simply take £ = ck. 

For p > 2, a similar claim holds for the condition < 1 — 7^ k}~ l l p /£. 

Proposition 3.1 implies that 7j, , , < 7? , . < 7 , , „ „ < j , „jui, „< 

V^p, J P \ , „■ Sparse eigenvalue condition (bounded eigenvalue) at block 

size of order k 2 ~ 2 l p implies that the condition < 1 — 7^ k}~ l l p /£ can 

be satisfied with an appropriate choice of £ = 0{k 2 ~ 2 / p ). Under assumptions 
that are stronger than the sparse eigenvalue condition, one can obtain bet- 
ter and simpler results. For example, this is demonstrated in Corollary 4.1 
under the mutual incoherence condition. 

Finally, in order to concretely compare the condition of Theorem 4.1 for 
different p, we consider the random design situation (with random vectors 
Xj), where each component Xj j is independently drawn from the standard 
Gaussian distribution iV(0, 1) (i = 1, . . . ,n and j = 1, . . . , d). This situation is 
investigated in the compressed sensing literature, such as [7]. In particular, 

it was shown that, with large probability, the following RIP inequality holds 

(2) 

for some constant c> (s < n < d): \p As — 1| + \P As ~ M — C\J s\nd/n. Now, 

for p>2, using Proposition 3.1, it is not hard to show that (we shall skip the 
detailed derivation here because it is not essential to the main point of this 



paper), for k <£, ef < 4c£ y / EJ/^ and J A ] k+i > 1 - 2c^J £ 2 - 2 / p \nd/ 



in. 



Therefore, the condition 0.5 < 1 —7 ?, „ „k l 1 ^ p /£ in Theorem 4.1 holds 

with large probability, as long as n > 256c 2 £ 2 ~ 2 ^ p lnd. Therefore, in order 
to apply the theorem with fixed k < £ and d, the larger p is, the larger the 
sample size n has to be. In comparison, the mutual incoherence condition 
of Corollary 4.1 is satisfied when n > c'£ 2 Ind for some constant d > 0. 

5. Noise and approximation error. In Theorem 4.1 and Corollary 4.1, 
we do not assume that (3 is the true parameter that generates Eyj. The 
bounds depend on the quantity ||i Y^i=\{B T ^i — Ey^XjUoo to measure how 
close (5 is different from the true parameter. This quantity may be regarded 
as an algebraic definition of noise, in that it behaves like stochastic noise. 

The following proposition shows that, if the least squares error achieved 
by P (which is often called approximation error) is small, then the algebraic 
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noise level (as denned above) is also small. However, the reverse is not true. 
For example, for the identity design matrix and /3^Xj = Ey^, the algebraic 
noise is \\Y^l=\{fi T ~ Ey^XjUoo = — /3*||oo) but the least squares ap- 
proximation error is — /3*|| 2 . Therefore, in general, algebraic noise can be 
small, even when the least squares approximation error is large. 

Proposition 5.1. Let A = i Ya=i x i x f and a = (supj Ajj) 1 / 2 . Given 
k>0, there exists 0^ k > G R d such that 

\ 1/2 



1 " 

-^(^x.-EyOx, 



n i 




i=i 



supp (^ (fc) -P) < k, and \\^-Ph < 2(n- 1 Y^i^-^i) 2 ) 1/2 /Jf*f k - 



This proposition can be combined with Theorem 4.1 or Corollary 4.1 to 
derive bounds in terms of the approximation error n~ 1 J2?=i{P T ~ x -i ~ Ey^) 2 
instead of the algebraic noise \\^J2i'=i(P T:x -i ~ Ey*) x^lloo" For example, as a 
simple consequence of Corollary 4.1, we have the following bound. A similar 
but less general result [with a = and | supp (/3)| = k] was presented in [9]. 

Corollary 5.1. Let Assumption 4-1 hold, let A = i^r=i x « x f an d 
assume that Ajj = 1 for all j. Define = sup^- |-Aij|. Let (3 be the so- 
lution of (1). Consider any fixed target vector (3 G R d . Given 5 G (0,1), 
then, with probability larger than 1—6, the following claim holds for all 
2k<£<(d-2k)/2, t G (0,1), p£ [l,oo]. IfM A {2k + £) < (l-t)/(2-t) and 

A > ^^{(y\]l ln(2d/<5) + e/y/k + T), then 

\\&-Ph< ^f^[l-5r£ ,> 05) + (2^) 1/2 A] +4rf (^) 
+ 4(8-7t) r ( 1)( ^_ 1/2 + 4£) 

where e = (n" 1 ELiO^** ~ Ey.) 2 ) 1/2 . 

A similar result holds under the sparse eigenvalue condition. We should 
point out that, in L\ regularization, the behavior of stochastic noise (a > 0) 
is similar to that of the algebraic noise introduced above, but it is very 
different from the least squares approximation error e. In particular, the 
so-called bias of L\ regularization shows up in the stochastic noise term but 
not in the least squares approximation error term. If we set a = but e ^ 0, 
our analysis of the two-stage procedure in Section 8 will not improve that 
of the standard Lasso given in Corollary 5.1, simply because the two-stage 
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procedure does not improve the term involving the approximation error e. 
However, the benefit of the two-stage procedure clearly shows up in the 
stochastic noise term. For this reason, it is important to distinguish the true 
stochastic noise and the approximation error e, and to develop analysis that 
includes both stochastic noise and approximation error. 

6. Dantzig selector versus L\ regularization. Recently, Candes and Tao 
proposed an estimator, called the Dantzig selector in [4], and proved a very 
strong performance bound for this new method. However, it was observed 
[8, 10] and [12] that the performance of Lasso is comparable to that of the 
Dantzig selector. Consequently, the authors of [10] asked whether a perfor- 
mance bound similar to the Dantzig selector holds for Lasso as well. In this 
context, we observe that a simple but important consequence of the first 
claim of Theorem 4.1 leads to a bound for L\ regularization that reproduces 
the main result for Dantzig selector in [4]. We restate the result below, which 
provides an affirmative answer to the above mentioned open question of [10]. 

Corollary 6.1. Let Assumption 4-1 hold, and let A= ^ Ya=1 x i x T an d 
a = (supj Ajj) 1 / 2 . Consider the true target vector (3 such that Ey = /? T x. 

Define A = ±E?= i XjXj . Let f3 be the solution of ( 1). Given 5 G (0, 1) ; then, 
with probability larger than 1 — 5, the following claim holds for all (d — k)/2 > 

I > k. Ift = 1 - irf. k°- 5 /£ > 0, A > 4(2 - t)t~ 1 aa v / 2ln(2d/5)/n, then 

Corollary 6.1 is directly comparable to the main result of [4] for the 
Dantzig selector, which is given by the estimator 



(3d = argmin \\(3\\i subject to sup 

/3eR d j 



^x^xf/J-yi 



i=l 



<b D . 



Their main result is stated below, in Theorem 6.1. It uses a different quantity 
&A,k li which is defined as 

- ft \\M,rth 

VA,k,e = sup 



f3€R e ,I,J \\P\\2 

J A,k,l 



using notation of Definition 3.1. It is easy to see that of. , < e A ,k,eVi. 



Theorem 6.1 [4]. Assume that there exists a vector (3 G R d with s 
nonzero components, such that y, = /3 T Xj + £j, where e« ~ A^(0,cr 2 ) are i.i.d. 
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Gaussian noises. Let A = - Ya=i x * x f ' > an ^ assume that Ajj < 1 for all j. 
Given to > and 5 G (0, 1), we set br> = s/uXdct, with 



\ D = (^/l- (ln5 + ln(Vvrln(i))/lnd + ^ 1 )V21nrf. 

Let h,2s = max ^i,L ~ X ' 1 ~ f%^- % ®A,2s + ®A,2s,s < 1 ~ t D, then, with 
probability exceeding 1 — 5, 

WfiD -M< c^h^A^l^^ + 4 2) CP) 2 )- 

The quantity C 2 (a,b) is defined as C 2 (a,6) = 2 ^°^-b + (i-a-6)^ + i-t-6 » 
w/iere C (a, 6) = 2>/2(l + I i f^ E ) + (1 + l/\/2)(l + a) 2 /(l - a - b). 

In order to see that Corollary 6.1 is comparable to Theorem 6.1, we shall 
compare their conditions and consequences. To this end, we can pick any 
£ £ [k,s] in Corollary 6.1. 

We shall first look at the conditions. The condition required in Theo- 

~ (2) 

rem 6.1 is 9^ 2s + 9^ 2 < 1 - t D , which implies that 9 A2ss < fiy - to- 

This condition is stronger than 9^\ +i J Vk < 9 ^ k+e e < ^ — tn, which 
implies the condition t > in Corollary 6.1: 



( 2 ) e t + ^K, ^Z + e-^ k 



n A ,k+e,e £ - (2) f * (2) f 

-P-V0+ >o. 

^A,k+t 

Therefore, Corollary 6.1 can be applied, as long as Theorem 6.1 can be 
applied. Moreover, as long as to > 0, t is never much smaller than tr> but 
can be significantly larger (e.g., t > 0.5 when k < £/2, even if to is close to 
zero). It is also obvious that the condition t > does not imply that to > 0. 
Therefore, the condition of Corollary 6.1 is strictly weaker. As discussed in 

(2) (2) 

Section 4, if Pa u P a k+2£ < c for a constant c > when £ < ck; then, the 
condition t > holds with 1 = ck. 

Next, we shall look at the consequences of the two theorems when both 
t > and tr> > 0. Ignoring constants, the bound in Theorem 6.1, with \d = 
0(y/ln(d/5)), can be written as 

\\P D - 0h = O^Hd/8)){rfXp) + try/kM. 
However, the proof itself implies a stronger bound of the form 

(3) \\(3 D - P\\2 = 0{o-^k\n{d/5)/n + rf 
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In comparison, in Corollary 6.1, we can pick A = 0(ay/\a.(d/5) /n), and then 
the bound can be written as (with I = s) 

(4) \\P-Ph = 0{<ryJkWd/S)/n + rj° 08) + r« (fi/V*). 

Note that we do not have to assume that (3 only contains s nonzero compo- 
nents. The quantity r£ \j3)/y/s is no more than r^\j3) under the sparsity 
of assumed in Theorem 6.1. It is thus clear that (4) has a more general 
form than that of (3). 

It was pointed out in [1] that Lasso and the Dantzig selector are quite 
similar, and the authors presented a simultaneous analysis of both. Since the 
explicit parameter estimation bounds in [1] are with the case k = supp (/3), 
it is natural to ask whether our results (in particular, the first claim of 
Theorem 4.1) can also be applied to the Dantzig selector, so that a simulta- 
neous analysis similar to that of [1] can be established. Unfortunately, the 
techniques used in this paper do not immediately give an affirmative answer. 
This is because a Lasso-specific property is used in our proof of Lemma 10.4, 
and the property does not hold for the Dantzig selector. However, we con- 
jecture that it may still be possible to prove similar results for the Dantzig 
selector through different techniques such as those employed in [4] and [6]. 

7. Feature selection through coefficient thresholding. A fundamental re- 
sult of L\ regularization is its feature selection consistency property, which is 
considered in [16] and more formally analyzed in [20]. It was shown that, un- 
der a strong irrepresentable condition (introduced in [20]) together with the 
sparse eigenvalue condition, the set supp (/3), with (3 estimated using Lasso, 
may be able to consistently identify features with coefficients larger than a 
threshold of order \/k\ (with A = 0(cry/ln(d/5)/n)). Here, k is the sparsity 
of the true target. That is, with probability 1 — 5, all coefficients larger than 
a certain threshold of the order 0{a^k\n{d/ 5) /n) remain nonzero, while 
all zero coefficients remain zero. It was also shown that a slightly weaker 
irrepresentable condition is necessary for Lasso to possess this property. For 
Lasso, the vk factor cannot be removed (under the sparse eigenvalue as- 
sumption plus the irrepresentable condition) unless additional conditions 
(such as the mutual incoherence assumption in Corollary 7.1) are imposed. 
Also, see [5, 18] for related results without the \fk factor. 

It was acknowledged in [19] and [17] that the irrepresentable condition 
can be quite strong (e.g., often more restricted than eigenvalue conditions 
required for Corollary 6.1). This is the motivation of the sparse eigenvalue 
condition introduced in [19], although such a condition does not necessarily 
yield consistent feature selection under the scheme of [20] , which employs the 
set supp (/3) to identify features. However, limitations of the irrepresentable 
condition can be removed by considering supp Q (/3) with a > 0. 
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In this section, we consider a more extended view of feature selection, 
where a practitioner would like to find relevant features with coefficient 
magnitude larger than some threshold a that is not necessarily zero. Features 
with small coefficients are regarded as irrelevant features, which are not 
distinguished from zero for practical purposes. The threshold a can be pre- 
chosen based on the interests of the practitioner as well as our knowledge 
of the underlying problem. We are interested in the relationship of features 
estimated from the solution f3 of (1) and the true relevant features obtained 
from /3. The following result is a simple consequence of Theorem 4.1, where 
we use ||/3 — j3 \\ p for some large p to approximate \\(3 — P\\oo (which is needed 
for feature selection). A consequence of the result (see Corollary 7.2) is that, 
using a nonzero threshold a (rather than zero-threshold of [20]), it is possible 
to achieve consistent feature selection even if the irrepresentable condition 
in [20] is violated. For clarity, we choose a simplified statement with sparse 
target (3. However, it is easy to see from the proof that, just as in Theorem 
4.1, a similar but more complicated statement holds, even when the target 
is not sparse. 

Theorem 7.1. Let Assumption 4-1 hold, and let A = ^2~Zr=i x i x f and 
a = (supj Ajj) 1 / 2 . Let /3 G R d be the true target vector with Ey = /3 T x, and 

assume that |supp (/?)| = k. Let (3 be the solution of (1). Given 5 6 (0,1), 
then, with probability larger than 1 — d, the following claim is true. For all 
e £ (0, 1), if there exist (d — k)/2 > £ > k, t £ (0, 1), and p £ [1, oo] so that: 

• A > 4(2 - t)t- l (<ja^2\n{2d/5)/n); 
. e^r8( £C ^ + p^ 

^)i/P A <i<l_ 7 W + ^i-i/ P/ £ ; 

then supp (1+e)a (/3) C supp a (/3) C supp (1 „ £)a (/3). 

If either t < 1 -tt ( ^ k 1 ' 1 ^ /£ort<l- 7 ( f } k 1 ' 1 ^ IL then the result 

— A,k+£,e ' — 1 A,k+l,l 1 ' 

can be applied as long as a is sufficiently large. As we have pointed out 
after Theorem 4.1, if the sparse eigenvalue condition holds at block size 
of order fc 2_2//p for some p > 2, then one can take £ = 0(k 2 ~ 2 / p ), so that 
the condition t < 1 — 7^ k l ~ l l p /£ is satisfied. This implies that we may 

take a = 0{k 2 /P- 2 l'P 2 X) = 0{ak 2 ^~ 2 /P 2 y/]n(d/5)/n), assuming that ^ ( 

is bounded from below (which holds when A is p-norm diagonal dominant at 
size k + £, according to Proposition 3.1). That is, sparse eigenvalue condition 

at a certain block size of order k 2 ~ 2 l p , together with the boundedness of 

(p) 

, imply that one can distinguish coefficients of magnitude larger than 
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a threshold of order ak 2 ^ p ~ 2 / p2 y/ln(d/5) Jn from zero. In particular, if p = oo, 
we can distinguish nonzero coefficients of order o~y/\n(d/5) Jn from zero. For 
simplicity, we state such a result under the mutual incoherence assumption. 

Corollary 7.1. Let Assumption 4-1 hold, let A = ^I]™ = iXjxf and 
assume that Ajj = 1 for all j. Define = sup^- |^4i,j|- Let (3 be the so- 
lution of (1). Let j3 G R d be the true target vector with Ey = /3 r x and k = 
I su PPo(/^)l • Assume that kM^ < 0.25 and 3k < d. Given 5 G (0, 1), with prob- 
ability larger than 1 — 5, if a/32 > A > 12a ^ 2h\(2d / 5) / n , then supp( 1+£ ) a (/3) C 
supp a (/3) C supp ( - 1 _ e ) Q (^), where e = 32\/a. 

One can also obtain a formal result on the asymptotic consistency of fea- 
ture selection. An example is given below. In the description, we allow the 
problem to vary with sample size n, and study the asymptotic behavior when 
n — > oo. Therefore, except for the input vectors Xj, all other quantities such 
as d, 0, etc., will be denoted with subscript n. The input vectors Xj G R dn 
also vary with n; however, we drop the subscript n to simplify the notation. 
The statement of our result is in the same style as a corresponding asymp- 
totic feature selection consistency theorem of [20] for the zero-thresholding 
scheme supp (/3), which requires the stronger irrepresentable condition in 
addition to the sparse eigenvalue condition. In contrast, our result employs 
nonzero thresholding supp Qjj (/3), with an appropriately chosen sequence of 
decreasing a n ; the result only requires the sparse eigenvalue condition (and, 
for clarity, we only consider p = 2 instead of general p discussed above) 
without the need for irrepresentable condition. 

Corollary 7.2. Consider regression problems indexed by the sample 
size n, and let the corresponding true target vector be (3 n = \fl n i, . . . ,P n ,d n ] £ 
R dn , where Ey = /?Jx. Let Assumption 4-1 hold, with a independent of n. 
Assume that there exists a > that is independent ofn, such that ^ J27=i x i j — 

a 2 for all j . Denote, by (3 n , the solution of (1) with A = 12<ra^/2(ln(2d n ) + n s ')/n, 
where s' G (0,1). Pick s G (0, 1 — s'), and set a n = n~ s / 2 . Then, as n — > 

00. P(supp a?i (/3 n ) ^ supp (/3 n )) = 0(exp(— n s )) if the following conditions 
hold: 

1. (3 n only has k n = o(n 1 ~ s " s ') nonzero coefficients; 

2. k n hi{d n )=o{n l - s ); 

3. l/min igsuppo(/ 3 n) |/3 nj -| = o{n s l 2 ); 
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4. Let A n = ^Z)r=i x « x r ^ R dnXdn . There exists a positive integer q n such 
that (1 + 2 Qn )k n < d n , l/»f n>(1+2qn)kn = 0(1), and pf^^ < d + 

The conditions of the corollary are all standard. Similar conditions have 
also appeared in [20] . The first condition simply requires (3 n to be sufficiently 
sparse; if k n is in the same order of n, then one cannot obtain meaningful con- 
sistency results. The second condition requires that d n is not too large, and, 
in particular, that it should be sub-exponential in n; otherwise, our analysis 
does not lead to consistency. The third condition requires that \P n ,j\ be suf- 
ficiently large when j € supp (/3 n ). In particular, the condition implies that 
each feature component |/3 n ,j| needs to be larger than the 2-norm noise level 
a\/k n ln(d n )/n. If some component (3 n j is too small, then we cannot distin- 
guish it from the noise. Note that, since the 2-norm parameter estimation 
bound is used here, we have a \fk~n factor in the noise level. Under stronger 
conditions, such as mutual incoherence, this \fk^ factor can be removed (as 
shown in Corollary 7.1). Finally, the fourth condition is the sparse eigen- 
value assumption; it can also be replaced by some other conditions (such as 
mutual incoherence). In comparison, [20] employed zero-threshold scheme 
with a n = 0; therefore, in addition to our assumptions, the irrepresentable 
condition is also required. 



8. Two-stage L\ regularization with selective penalization. We shall re- 
fer to the feature components corresponding to the large coefficients as rel- 
evant features and the feature components smaller than an appropriately 
defined cut-off threshold a as irrelevant features. Theorem 7.1 implies that 
Lasso can be used to approximately identify the set of relevant features 
supp a (/3). This property can be used to improve the standard Lasso. In this 
context, we observe that as an estimation method, L\ regularization has two 
important properties, which are as follows: 

1. Shrink estimated coefficients corresponding to irrelevant features toward 
zero; 

2. Shrink estimated coefficients corresponding to relevant features toward 
zero. 

While the first effect is desirable, the second effect is not. In fact, we should 
avoid shrinking the coefficients corresponding to the relevant features if we 
can identify these features. In this case, the standard L\ regularization may 
have sub-optimal performance. In order to improve L\, we observe that, 
under appropriate conditions such as those of Theorem 7. 1 , estimated coef- 
ficients corresponding to relevant features tend to be larger than estimated 
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coefficients corresponding to irrelevant features. Therefore, after the first 
stage of L\ regularization, we can identify the relevant features by picking 
the components corresponding to the largest coefficients. Those coefficients 
are over-shrinked in the first stage. This problem can be fixed by applying 
a second stage of L\ regularization, where we do not penalize the features 
selected in the first stage. The procedure is described in Figure 1. Its overall 
effect is to "unshrink" coefficients of relevant features identified in the first 
stage. In practice, instead of tuning a, we may also let supp Q (/3) contain 
exactly q elements, and simply tune the integer valued q. The parameters 
can then be tuned by cross-validation in sequential order: first, find A to 
optimize stage 1 prediction accuracy; second, find q to optimize stage 2 
prediction accuracy. If cross-validation works well, then this tuning method 
ensures that the two-stage selective penalization procedure is never much 
worse than the one-stage procedure in practice, because they are equivalent 
with q = 0. However, under the right conditions, we can prove a much better 
bound for this two stage procedure, as shown in Theorem 8.1. 

A related method, called relaxed Lasso, was proposed recently by Mein- 
shausen [15], which is similar to a two-stage Dantzig selector in [4] (also see 
[12] for a more detailed study). Their idea differs from our proposal in that, 
in the second stage, the parameter coefficients (3j are forced to be zero when 
j supp (/3). It was pointed out in [15] that, if supp (/3) can exactly identify 
all nonzero components of the target vector, then, in the second stage, the 
relaxed Lasso can asymptotically remove the bias in the first-stage Lasso. 
However, it is not clear what theoretical result can be stated when Lasso 
cannot exactly identify all relevant features. In the general case, it is not 
easy to ensure that relaxed Lasso does not degrade the performance when 
some relevant coefficients become zero in the first stage. On the contrary, 
the two-stage selective penalization procedure in Figure 1 does not require 
that all relevant features are identified. Consequently, we are able to prove 
a result for Figure 1 with no counterpart for relaxed Lasso. For clarity, the 



Tuning parameters: A,a 

Input: training data (xi.yi] (x„,y„) 

Output: parameter vector p 

• Stage 1. Compute fi using (1). 

• Stage 2, Solve the selective penalization problem 




1 



n 



j<S»«Pl>„ (■*> 



FlG. 1. Two-stage L\ regularization with selective penalization. 
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result is stated under similar conditions to those of the Dantzig selector in 
Theorem 6.1 with sparse targets and p = 2 only. Both restrictions can be 
easily removed, with a more complicated version of Theorem 7.1, to deal 
with nonsparse targets (which can be easily obtained from Theorem 4.1) as 
well as the general form of Lemma 10.4, which allows p^2. 

Theorem 8.1. Let Assumption 4-1 hold, and let A = i^" =1 XjX^ and 
a = (supj Ajj) 1 / 2 . Consider any target vector (3 G R d such that Ey = /3 T x, 
and /3 contains only s nonzeros. Let k = | supp A (/3)|. Consider the two-stage 
selective penalization procedure in Figure 1. Given 5 G (0,0.5), with proba- 
bility larger than 1 — 25 for all (d — s)/2 > £ > s and t G (0, 1), assume the 
following: 

• 0.5a > A > 4(2 - t)t~ 1 aay / 2ln{2d/5)/n; 

• either \§(aJf f ) -1 s 1/p A < t < 1— c /~ ltv l£, or 16(a^ ( ? } J" 1 (s + 
Then, 



11/3' - Ph < -^—l^k+fk^) + + a < 1 + \l™HV5))\l<l/n] 

+ 8rf(/3), 
where q = | supp L 5a (P)\. 

Again, we include a simplification of Theorem 8.1 under the mutual inco- 
herence condition. 

Corollary 8.1. Let Assumption 4-1 hold, let A = i^r=i x « x f an d 
assume that Ajj = 1 for all j . Define = sup^- Consider any target 

vector (3 such that Ey = /3 T x, and assume that j3 contains only s nonzeros 
where s < d/3 and assume that M^s < 1/6. Let k = |supp A (/3)|. Consider 
the two-stage selective penalization procedure in Figure 1. Given 5 G (0,0.5), 
with probability larger than 1 — 25, if a/48 > A > 12ay/2\n(2d/5)/n, then 



11/3' -P\\ 2 < 2A^l~~ q \ + 2Aa[l + ^ 2 ^- ln(l/5) ) + IG&Sf 09), 
where q = | supp 1-5a (/9)| . 



Theorem 8.1 can significantly improve the corresponding one-stage result 
(see Corollary 6.1 and Theorem 6.1) when r£ (/3) <C VkX and k — q<^k. The 



20 



T. ZHANG 



latter condition is true when | supp L5a (/3)| ~ | supp A (/3)|. In such a case, we 
can identify most features in supp^(/3). These conditions are satisfied when 
most nonzero coefficients in supp^iP) are relatively large in magnitude and 
the other coefficients are small (in 2- norm). That is, the two-stage procedure 
is superior when the target (3 can be decomposed as a sparse vector with 
large coefficients plus another (less sparse) vector with small coefficients. 

(2) ~ 

I ii the extreme case, when r k (p 1 ) = and q = k, we obtain \\(3' — /3\\2 = 
0(y/kln(l/S)/n) instead of \\j3~ - /3|| 2 = 0(yJkln(d/5)/n) for the one-stage 
Lasso. The difference can be significant when d is large. 

Finally, we shall point out that the two-stage selective penalization pro- 
cedure may be regarded as a two-step approximation to solving the least 
squares problem with a nonconvex regularization: 

" 2 n d 

- ^(/3 T Xi - yi) 2 + A min(a, \/3j 



p = arg min 



n . 

i=l 3=1 



However, for high-dimensional problems, it not clear whether one can effec- 
tively find a good solution using such a nonconvex regularization condition. 
When d is sufficiently large, one can often find a vector f3, such that \(3j \ > a 
and it perfectly fits (thus overfits) the data. This j3 is clearly a local minimum 
for this nonconvex regularization condition, since the regularization has no 
effect locally for such a vector f3. Therefore, the two-stage L\ approxima- 
tion procedure in Figure 1 not only preserves desirable properties of convex 
programming, but also prevents such a local minimum to contaminate the 
final solution. 



9. Experiments. Although our investigation is mainly theoretical, it is 
useful to verify whether the two stage procedure can improve the standard 
Lasso in practice. In the following, we show with a synthetic data and a real 
data that the two-stage procedure can be helpful. Although more compre- 
hensive experiments are still required, these simple experiments show that 
the two-stage method is useful at least on datasets with the right proper- 
ties, which is consistent with our theory. Note that, instead of tuning the a 
parameter in Figure 1, in the following experiments, we tune the parameter 
q = supp a (/3), which is more convenient. The standard Lasso corresponds to 
q = 0. 

9.1. Simulation data. In this experiment, we generate an n x d random 
matrix with its column j corresponding to [xij, . . . ,x nj ], and each element 
of the matrix is an independent standard Gaussian iV(0, 1). We then normal- 
ize its columns so that J2?=i x fj = n. A truly sparse target (3, is generated 
with k nonzero elements that are uniformly distributed from [—10,10]. Ob- 
serve that yj = /? T Xj + £j, where each e% ~ N(0, a 2 ). In this experiment, we 
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Fig. 2. Performance of the algorithms on simulation data. Left: average training squared 
error versus X; right: parameter estimation error versus X. 



take n = 25, d = 100, k = 5 and a = 1, and repeat the experiment 100 times. 
The average training error and parameter estimation error in 2-norm are 
reported in Figure 2. We compare the performance of the two-stage method 
with different q versus the regularization parameter A. Clearly, the training 
error becomes smaller when q increases. The smallest estimation error for 
this example is achieved at q = 3. This shows that the two-stage procedure 
with appropriately chosen q performs better than the standard Lasso (which 
corresponds to q = 0). 

9.2. Real data. We use real data to illustrate the effectiveness of two- 
stage L\ regularization. For simplicity, we only report the performance on a 
single data, Boston Housing. This is the housing data for 506 census tracts 
in Boston from the 1970 census, available from the UCI machine learn- 
ing database repository (http://archive.ics.uci.edu/ml/). Each census 
tract is a data-point with 13 features (we add a constant offset on e as the 
14th feature), and the desired output is the housing price. In the experi- 
ment, we randomly partition the data into 20 training plus 456 test points. 
We perform the experiments 100 times and report training and test squared 
error versus the regularization parameter A for different q. The results are 
plotted in Figure 3. In this case, q = 1 achieves the best performance. Note 
that this dataset contains only a small number {d = 14) features, which is 
not the case we are interested in (most of other UCI data similarly contain 
only small number of features). In order to illustrate the advantage of the 
two-stage method more clearly, we also consider a modified Boston Housing 
data, where we append 20 random features (similar to the simulation ex- 
periments) to the original Boston Housing data, and rerun the experiments. 
The results are shown in Figure 4. As we can expect, the effect of using q > 
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Fig. 3. Performance of the algorithms on the original Boston Housing data. Left: average 
training squared error versus A; right: test squared error versus A. 



becomes far more apparent. This again verifies that the two-stage method 
can be superior to the standard Lasso (q = 0) on some data. 

10. Proofs. In the proof, we use the following convention: let / be a sub- 
set of {l,...,d} and a vector (3 S R d , then (3j denotes either the restriction 
of (3 to indices /, which lies in or its embedding into the original space 
R d with components not in / set to zero. 

10.1. Proof of Proposition 3.1. Given any v 6 R k and u £ R e , without 
loss of generality, we may assume that 1 1 v 1 1 2 = 1 an d 1 1 u. 1 1 2 = 1 m the following 
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Fig. 4. Performance of the algorithms on the modified Boston Housing data. Left: aver- 
age training squared error versus A; right: test squared error versus A. 
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derivation. Take indices / and J as in the definition. We let I' = IUJ. Given 
any a£ R, let u' T = [v T ,au T ] E R i+k . By definition, we have 

^S+fcll u 'll2 ^ u /T 4r',J'u' = v r ^ 7>/ v + 2av T Auu + a 2 u T Aj tJ u. 

Let b = v T Aij~a, ci = v T Aj jv and C2 = u^A^ju. The above inequality can 
be written as 



PAi+kQ + a ) < ci + 2a6 + a c 2 . 
By optimizing over a, we obtain (ci — n^\ +k )(c2 — p^e+k) — ^ ■ Therefore, 
b 2 < (p% - ^A! e+k ){p [ A,k ~ /^S+fe)' which implies that (with ||v|| 2 = ||u|| 2 = 
1) 



wTAi - jU < |V ^ JU ' < VI\b\ 



II v l|2 Hulloo || v||2 ||u||2 /V^ 

< y/iJlo® -a {2) )(o {2) -u {2) ) 

Since v and u are arbitrary, this implies the first inequality. 

The second inequality follows from ||v|| p < k max ( ,i/p- - 5 ) ||v||2 for all v G 
R k , so that 

l|Ar,ju||p < femaJ e(0,l/p-0.5) II^J,J u ll2 
||u||oo ||u||oo 

From (ci - /i^ +fc )(c 2 - MA/+fe) - we a ^ so obtain 

46 2 /c? < 4^(1 - mS + ,/ci)(c 2 - M?i+fc) ^ ( c 2 " MJi +fe )/M?i +fe 
-PA,e/PA,e+k L - 

(2) —1 (2) —1 

Note that, in the above derivation, we have used £+fc c 1 (1 — p A i +k c x ) < 
1. Therefore, with 1 1 v 1 1 2 = 1 1 u. 1 1 2 = 1, 

V^julHb |v r ^J,ju| _ |6j r 1/2 / ( 2 ) ( 2 ) ~ 

v^viMU - ^a^/Si " Cl V£ - a5£ V p ^ //X ^+* L 

Because v and u are arbitrary, we obtain the third inequality. 

The fourth inequality follows from max(0, v T Ajjv p ~ l ) >io^\\\v\\P, so 
that 

(vP- 1 ) T A / , J u||v|| p ^ 1 |(yP" 1 ) T ^ / , J u| 1 P/,ju|| p 



max^v^vf-^HulU-^W HvH^HU " ^ ||u 



In the above derivation, the second inequality follows from ||(v/ 
ll v llp) p_1 ||p/(p-i) = 1 an d the Holder's inequality. 
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The fifth inequality follows from ||v|| p < fc nu «( . 1 /p-o.6)|| v || 2 f or a n v G Rk ; 
so that 

PZ,/^,J U Hp < I^V^pax(0,l/ P -0.5) 

The sixth inequality follows from \\AJ }v|| p < ||v|L//ijf fc for all v £ fi^, so 
that 

\\ a 1} a i,M\p < i P/,ju|| p 



l|U||oo ^ IHU 

The last inequality is due to 
P/,jv|| P _ ll(v/l|v||p) p ~ 1 ||p/( P -i)II^J,Jv||p 

ll v llp ll v llp 

(v^T^jv _ (vP" 1 ) T diag(A JiJ )v 
— n hp n hp 

II v IIp II v IIp 

(vP- 1 ) r (A / , / -diag(^ J , / ))v 

n hp 
ll v llp 

> minimi - ||(v/||v||p) p \\ p /(p-i) ! ij-fi 1 

1 llvllp 

> mm Ai,i - \\Aij - diag(Ar,/)||p- 

i 

In the above derivation, Holder's inequality is used to obtain the first two 
inequalities. The first equality and the last inequality use the fact that ||(v/ 

II v IIp) p ~ llp/(p-i) = !• 

10.2. Proof of Proposition 3.2. Let B G R dxd be the off-diagonal part of 
A; that is, A — B is the identity matrix. We have supj j \Bij\ < Ma- Given 
any v G R k , we have 

||-B/,/v|| p < M A k 1/p \\-v\\i < M A k\\v\\ p . 
This implies that ||-B/,/|| p < M&k. Therefore, we have 

Mj,jv|| p < ||v||p(l + Mfc). 
This proves the first claim. Moreover, 

(1 - M A k) < 1 - = min^.i - \\A U - diag(^/./)|| p . 

% 

We thus obtain the second claim from Proposition 3.1. 
Now, given u G R , since I D J = 0, we have 

\\Ai,ju\\ p = \\B ItJ u\\ p < Ma^WuW! < Ma^IWuWoo. 

This implies the third claim. The last two claims follow from Proposition 
3.1. 
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10.3. Some auxiliary results. 

Lemma 10.1. Consider k,£ > and p £ [1, oo] . Given any (3,veR d , let 
(3 = (3f + Pq, where supp (/3f) H supp (/3g) = and | supp (/3i?)| = k. Let 
J be the indices of the £ largest components of Pg (i n absolute values), and 
I = supp (/?f) U J. If suppo(v) C I, then 

v T Ap > vJAuPj - WAuvjWp/^^l^WPaWxr 1 , 

ma X (0, ((v/||v|| p r YAP) > w^(||v/|| p - ^^WPgWi) 

Proof. In order to prove the first inequality, we may assume, without 
loss of generality, that P = [Pi , . . . , Pd] where supp (/?i?) = {1,2,..., k}, and 
when j > k, Pj is arranged in descending order of \Pj \ : \Pk+i\ > |/9fc+2| > 
••• > \p d \. Let J = {1,.,.,/e}, and let J s = {k + (s - 1)1 + 1, . . . , k + si} 
(s = 1, 2, . . .), except the largest index in the last block stops at d. Note that, 
in this definition, we require that J± = J and I = JqVJ J\. We have \\Pj„ ||oo < 
WPjs-i lli^ -1 when s > 1, which implies that Y, s> i IIA/Joo < ll/fclli^ -1 - This 
gives 

v T A/3 = vjAijPi + vjA ItJs P Js 

8>1 

> vf - ||Aj 1 jvj|| p /^_ 1 ) ^ P7j^/ iJs /3j s ||p 

S>1 

» 

s>l 



> ^i | j/3j-7^j JU ||Aj,jVj|| J , /(p _ 1) ||A 3 ||ir 1 . 

The first inequality in the above derivation is due to Holder's inequality. 
This proves the first inequality of the lemma. 

The proof of the second inequality is similar, but with a slightly different 
estimate. We can assume that the right-hand side is positive (the inequality 
is trivial otherwise). It implies that (v I ~ 1 ) T Ai jvj > 0, since, otherwise, 

(vP-YAp = {v^f AiAPi - vz) + (vt'YAjjvj + ^(vj-fiy^i 

S>1 



>(v?- 1) ) T A 7 ,/(/3/-v / ) 



+ (vt 1) fA IJ v I 



i-^m.iEll^lU/llv/llp 

s>l 
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^ -pS/|II v ?" 1 X/(p-i)I^ 7_Vj IIp 
>-p%\\\^\/^)Wi-Mv 

+ ^j / |||v 7 ||^l- 7 r5 , j /| /- 1 ||/3 G || 1 /||v J || p ]. 

The second inequality in the above derivation is due to Holder's inequality. 
The last inequality assumes that the right-hand side is nonnegative. Observe 
that ||vj p = ||v/||^ _1 ; thus, we obtain the second inequality of the 

lemma. □ 



Lemma 10.2. Consider the decomposition of any target vector (3 = (3f + 
Pg such that {1, 2, . . . ,d} = F U G and F n G = 0. Consider the solution (3 
to the following, more general, problem instead of (1): 



(5) 



0- 



arg mm 



^5> T **-y,) 2 + A£l/y 



where F C F. Let A/3 = j3 - j3, A = ± £*=l x;xf and i = ± £? =1 (/3 T x 
yi)xj. If we pick a sufficiently large A in (5) such that A > 2||e : || 00 , then 



A — A \\E ~, 



3 C? 1; 



Proof. We define the derivative of \\(3\\i as sgn(/3), where, for j3 = 
\Pi,...,0 d ]€ R d , sgn(/3) = [sgn(/?i), . . . , sgn(/3 d )] G i? d is defined as sgn(/3 J ) = 
1 when f3j > 0, sgn(/3j) = —1 when (3j < and sgn(/3j) G [—1, 1] when (3j = 0. 
We start with the first-order condition 

2 n 

-^(/3 T x i -y i )x i + A 5 (/3) = 0, 

where g0) = [g0i), . . . ,g0 d )], with g((3j) = when j € F and = 
sgn((3j), otherwise. This implies that 

r, n 

2iA/3 + A<?(/3) = — £(/3 T Xi - yi )xi. 
Therefore, for all v G we have 



(6) 



2v T iA/3 < -2v T e - \v T g{{3). 
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Now, let v = A/3 in (6), and use the fact that $Qg($G) = Pg s S n (AsO = II Aslli 



< 2A/3 T iA/3 < 2|A/3 T e| - AA/3 T 5 (/3) 

< 2||A/3|| 1 p|| 0O - \A0 F T g0) - A/3§ 5 (/3) + A/3g 5 (/3) 

< 2(||A/3 F || 1 + H/SgIIi + Pg||i)P||oc + A||A/3/H|i - \\\$g\\i + \\\Pg\\i 
= (2||e|| 00 - A)||/3 G ||i + (2||e|| 00 + A)(||A/3 F ||i + \\0 G \\i). 



By rearranging the above inequality, we obtain the desired bound. □ 

Lemma 10.3. Let the conditions of Lemma 10.2 hold. Let J be the in- 
dices of the largest I coefficients (in absolute value) of A/3 G and I = F U J. 
7/A>4(2-t)t" 1 p|| 0O for some t€ (0,1), thenVpG [l,oo], 



||A/3||i< 4fc 1 " 1 /P|| + 4p G ||i, 

||A/3|| p <(1 + 3(A ; /£) 1 " 1 /p)||A/3/|| p + 4||/3 g || 1 ^ 1 . 

Proof. The condition on A implies that (A + 2||e|| 00 )/(A — 2||e|| 00 ) < 
(4 - t)/(4 - 3i) < 3. We have, from Lemma 10.2, 

||AA?||i < Pg||i + < 3||A/3 F ||i + 4p G ||i. 

Therefore, ||A/3 - A/3/|U < ||A/3 G ||i/^ < (3||A/3 F ||i + 4||/3 G ||i)/£, which im- 



||A/3 - APj\\p < (IIA/3GIKHA/3 - A/3/H^ 1 ) 1 ^ < (3||A/3/H|i + 4||A?|| 1 )£ 1 /p-i. 



we obtain the desired bounds. □ 

Lemma 10.4. Let the conditions of Lemma 10.2 hold. Let J be the in- 
dices of the largest i coefficients (in absolute value) of A/3 G and I = F U J. 



as well as ||ff(/3)||oo 

< 1. We obtain 



plies that 



Now, the first inequality in the proof also implies that 





1/p A + ik> y 




where Fq is any subset of F. 
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Proof. The condition of A implies that (A + 2||e|| 00 )/(A — 2||e|| 00 ) < 
(4 - t)/{A - 3t). Therefore, if we let A/3 = A/3 7 = A/3/ + /3j, then 

max(0,((A/3 / /||A/3 / || p r 1 ) T iA/3) + ^p J || p 



^^/IIA/j^p-^Z-lA^iu; 
>"%J\Wi\\* 



> J^jwaPjW, - (i - 1)(4 - 1)(4 - st)- 1 ]^/^ 

In the above derivation, the first inequality is due to Lemma 10.1, and the 
second inequality is due to Lemma 10.2. The third inequality uses ||A/3jr||i = 
||A/3 F ||i < fc 1_1 / p ||A/3j|| p and (4-f)/(4-3i) < 3. The last inequality follows 
from 1 - (1 - t)(4 - t)(4 - St)" 1 > 0.5*. 

If (A/3 I ~ 1 ) T AA/3 < 0, then the above inequality, together with ||A/3j|| p < 
||A/3j|| p + ||/3j||p < ||A/3/||p + ||j9g||p, already implies the lemma. Therefore in 
the following, we can assume that 



-^Mi«/-'ii*iu-<' t+( ii&ii P . 

Moreover, we obtain, from (6) with v = A/?]? -1 , the following: 
<|(A/3r 1 ) T e|-A(A/3r 1 ) T ^)/2 

a aP~ 



< |(^ _1 ) T e| - \0y l ) T g{(3)/2 + \{A^ F fi\ 



- \{A^ F -} F fgCP)/2 + \(Aj3 F ~ 1 ) T e\ 

< (Halloo - A/2)||/3^ _1 ||i + (||f |U + A/2)||A^V lli + KA/^fel 

< |ir |)Vp A || Ayf^TVo Hp/Cp-i) + W%\/^-i)¥fA p 

^((^-iFoD^A+ii^yiiA^n^ 1 . 
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In the above derivation, the second inequality uses g(PF ) = 0; the third 
inequality uses the fact that ||g(/3)||oo < 1 and (/3j~ ) T g{(3) = \\(3j~ ||i; the 
fourth inequality uses Halloo — O.5A5 and the last inequality uses the fact 
that = ||/3||p _1 - Now, by combining the above two estimates, 

together with ||A/3/|| p < ||A/3/|| p + \\/3j\\ p < ||A/3/|| p + \\(3 G \\ P , we obtain the 
desired bound. □ 

Lemma 10.5. Let the conditions of Lemma 10.2 hold. Let J be the in- 
dices of the largest I coefficients (in absolute value) of A(3c and I = F U J. 

Assume thatA ItI is invertible. LetpG [l,oo]. Ift=l-^? ) k+ee k 1 " 1 / p £~ 1 >0 
and A>4(2-t)t~ 1 ||e|| 00 . Then, 

Proof. Consider v G R d such that supp (v) C /. We have, from Lemma 
10.1 and (6), 

v T AAfr - \\M,iM v /{ P -i)l% + ^\ A/Mi* -1 < v T iA/3 

<-v T (e + 0.5A 5 (/3)). 
Take v such that HA^/v/Hp/^!) = 1 and v T AA(3j = \\A/3i\\ p . We obtain 

II A PiWp ~ ^i+^ll/Mi + ll^HOr 1 < \\Aj}(ij + O.5Xg0j))\\ P - 

By using Lemma 10.2, (A + 2||e|| 00 )/(A - 2p||oo) < (4 - t)/(4 - 3t) < 3 and 
1 - (1 - t)(4- t)/(4 - 3t) > 0M; thus, we obtain 

ll^ll^-^U/" 1 "!^"! 1 

> ||A&|| P - 7i P) fc+ ,/" 1(4 " t)(4 " 3*) -1 (I|AjMi + IIH) 

> || A/3^ - 7i P) fc+ ,/ _1 ( 4 ~ W 4 " atr^A^-VPHA^Hp + p G ||i) 

> HA/3/llp - (1 - t)(4 - t)(4 - S^^IIA^Hp - St^/^PgIIi 

^O.StllA^llp-ST^/- 1 !!^!)!. 
Combine the previous two inequalities. We obtain 
0M\\Ak\\ P < 47i P) w II^G||ir 1 + \\Aj}(ij + O.5\g0i))\\ p 

^ 4 ^ii + «II^H irl + ( fc + ^) 1/P H^ + ^ X 90l)\Un% + ,. 
Since + . 5 Ac/ ( ) 1 1 oo < A, we obtain the desired bound. □ 
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Proposition 10.1. Consider n independent random variables £1, ■ • ■ ,£ n 
such that Ee*fo _E &) < e' 7 ?' 2 / 2 for all t and i, then \fe > 0: 



P 



i=l 



i=l 



>ne <2e-" 2£2 /( 2 ^ ff2 ). 



Proof. Let s n = X)" =1 (& then, by assumption, E(e tSn +e tSn ) < 

2e52i a > t2/2 , which implies that P(|s n | > ne)e tne < 2e£i°** 2/2 . Now, let i = 
n£ /Hi°~ii thus, we obtain the desired bound. □ 

Proposition 10.2. Consider n independent random variables £i, .. . ,£ n , 
suc/i i/iai E^j = and Ee*& < e CT * / 2 for all t and i. Let z±, . . . ,z n £ R d be 
n fixed vectors, and let a n = (J2i=i ll 2 *!!!) 1 ^ 2 - Then, Ve > 0: 



£& 

i=l 



>a n {a + e)\<e^ 2 l^ 2 \ 



Proof. For each i, let ^ be an identically distributed and independent 
copy of & and /i(-) be any real-valued function such that h(£i) — h{^[) < 
+ Then, 



Et.6 



t(A(fc)-E*/&(£)) 



oo ^ 



fc=2 K - 



< 1 + £ + ^>m k * 1 + E ^rr^l* 



fc=2 
oo 



fc=2 



fc! 



= i + E 

fe=l 

oo 

<i + E 



i 



i 



2fc+l 



fc=l 



(2fc)! &l 1 (2fc + l)! 4J 1 

1 , iOL 0.5 .OL 1 



(2*0 



f (2^! E| 



(2fc + 2)! 



E|2te 



|2fc+2 



E|2t& 



2fc 



1 + 1.25(Ee 2 * ?l + Ee~ 2 * ?I - 2) 



< l + 1.25(2e 



2£ (T 



2) < e 



5t^ 



The second inequality is due to Jensen's inequality. In the third inequal- 
ity, we have used \a\ 2k+l /(2k + 1)! < 0.5|a| 2fc /(2A:)! + \a\ 2k+2 /(2k + 2)!. The 
last inequality can be obtained by comparing the Taylor expansion of the 
function e x on both sides. 
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Now, let sj = E^. +li ... i5n || E"=i^ z ill2- If we regard h(£j) = Sj/||zj|| 2 as a 
function of £j (with variables £1, . , Cj-i fixed), then Sj — Sj-x = (h(£j) — 
E£ih(£j))\\*jh and hfe) - h(Q) < + Therefore, from the above 
inequality, we have E^e*^ -8 * -1 ) < e 5 H Zj , H1* 3 °' 2 and 

Ec t e ts J = Ec t e * s j-iE c e^^' - ^- 1 ^ < e 5 'l z -'ll2 fT2 * 2 Ec t e*^'- 1 

By induction, we obtain E^ ...,£„e* Sn < e 5<j2i2a n e * s o ) w hich implies that P(s n > 

s + a n e)e t ( S0+a " £ ) < e" 1 ^ 2 e u ° . Let t = e /(i0a n tT 2 ), we have P(s n > s + 
ane)<e- 2 /(20<x 2 ). 

Note that 

Eg = lim IcEfee* - 1) < lim ~ 1} = a 2 . 

Therefore, s = E|| £™ =1 &*ih < (E"=l E 4 ? ll z i lli) 1/2 < a n°- This leads to 
the desired bound. □ 



10.4. Proof of Theorem 4-1- Let F be the indices corresponding to the 
largest k coefficients of (3 in absolute value. We only need to estimate ||e||oo 
and then apply Lemmas 10.4, 10.5 and 10.3, with Fq = 0. By Proposition 
10.1, we have 



P 



sup xfj < na 2 and sup 
. 3 i=1 ' 3 



ft. _. 

~ S(y* - EyOxy ; 

-ra 2 e 2 /f2o- 2 V™ x 2 ) / v~"* 2 2 1 

2asupe M i^x=\ subject to sup / j x^ 7 -/ < na 

3 V i' i=i ' / 



< d sup P 

3 



< 



sup^xfj, < 

3' i=X 



<2de~ n£ 1^° a \ 

Therefore, with probability larger than 1 — 5, if sup,,- Yh=x x ?j — na 2 , then 



sup 

3 



- 2(yi - EyiK. 



i=l 



'21n(2d/tf) 



The latter implies that 
1 



|£||oo = SUp 
3 



n 



^(/3 T Xj-yj)x 



< <rai 



'21n(2d/£) 



+ 



n 



^(^Xi-Ey^x, 



i=i 
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With this bound, the condition of Lemma 10.3 is satisfied. Using kjl < 1, 
we obtain, for q = l,p, 

IIA/3H, < 4fc 1 /^ 1 /P|| a^jHp + 4p G || 1 ^- 1 . 

This estimate, together with Lemma 10.4 (let Fq = 0), leads to the first 
claim of the theorem; with Lemma 10.5, it gives the second claim. 

10.5. Proof of Corollary 4.1. The condition M A (k+£) < (l-t)/(2-t) is 
equivalent to t< 1- M A (k + 1) /{I - M A (k + £)). It implies t < l-M A (k + 
l) 1 /'Pk 1 - l / p /{I - M A (k + £)). Now, using Proposition 3.2, it implies that 

the condition t < 1 — 7r^, „ k l ~ l / p It in the first claim of Theorem 4.1 is 
satisfied. Therefore, we have (with q = p) 
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+ 

Now, using Proposition 3.2 again, the inequality can be simplified to 



Now, by using the condition M A (k + £)) < (1 — i)/(2 — i) < 0.5 to eliminate 
and simplify the result, we obtain the desired bound. 

10.6. Proof of Proposition 5.1. We construct a sequence (3^ with a 
greedy algorithm as follows. Let (3^ = /3, and, for k = 1,2, ... , we perform 
the following steps: 



j 



(k) 



arg max, | E"=i (/3 (fc 1)T Xi - Ey;) Xi j | ; 



flW = fi( k - 1 ) -\- aWe.(i), where G i?" 1 is the vector of zeros, except for 
the jth component being one. 

The following derivation holds for the above procedure: 



£(/?( fc ) r x,-Ey 2 



t=l 



^(^"^-Ey.f + a^E^ 



,2 
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n 

+ 2a( fe )^(/3( fc - 1 ) T x J -Ey i )x M , 
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j(k) 



i=l 



< E(/? {fc - 1)T x, - Ey 2 ) 5 

i=l 
n 



i=l 



i=i 

n 



i=l 



(na 2 ) 



'(na 2 ). 



In the above derivation, first equality is simple algebra; the inequality uses 
the definition of a and a^; and the last equality uses the definition of 
Since E" =1 (/3 (fc+1)T x; - Ey,) 2 > 0, we obtain 



k 

E 

k'=0 



^(/3( fc ')T X ._ Ey . )X3 



i=l 



i=l 



Therefore, there exists k' <k such that the displayed equation of the propo- 
sition holds with (3^ = f3^ k '\ Moreover, 

y^||(/9 (fc) -^)|| 2 <||i 1/2 (^ fe) -^)l| 2 

< ||iV2 ( ^) _ Ey) || 2 + pl/2 (/ g _ Ey) || 2 

<2||i 1 /2 ( ^_ Ey) || 2 . 

The proof is complete. 



10.7. Proof of Corollary 5.1. The proof is just a straightforward appli- 
cation of Corollary 4.1, in which we replace (3 by of Proposition 5.1 and 
then replace k by 2k. This leads to the bound 

\\fi-0h<\\$-P k) \\2 + \\P-P k) \\2 

< ?^[i.5rgG9M) + (2fc)^ A ] + 4rSCSW) 

Note that, from Proposition 3.2, we have I/Wm^ < (1 - ^M^)- 1 / 2 < 2. 

Moreover, since supp (/#) - (3) <k, we have r^(p^) < r£°(,9). This leads 
to the desired bound. 
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10.8. Proof of Corollary 6.1. Note that Proposition 3.1 implies that 

■*T, 1 <0 { f 1 „Jujf, ,<pf, „\/^M 2) , .■ A1 so note that uA 2 > = fif.. f 

A,k+i,£ — A,k+£,e' A,k+e — r A,k+l 1 A,k+£ A,k+£ ^A,k+t 

The first statement of Theorem 4.1 (with q = p = 2) implies the desired 
bound. 



10.9. Proof of Theorem 7.1. Under the conditions of this theorem, we 
obtain, from Theorem 4.1, that, with probability 1 — 5, the following two 
claims hold (with q=p): 

• If t < 1 - 7r { f] o k l ~ l /v/l, A > 4(2 - t)t- 1 (aa^2ln(2d/S)/n), then 

tUJ - 

A,k+l 

• If i < l-7jf ) fc+ ^fc 1_1/ 74 A > 4(2-t)t- 1 (aa v /21n(2d/,5)/n), then ||A/3|| P < 

That is, if there exist £>k, t G (0, 1), and p G [1, oo] so that: 

• A > 4(2 - t)t- 1 (fjaV21n(2(i/5)/n); 

• either t < 1 - 7r ( f } , no k l ~ xlv l£, a > Be" 1 ^?, ,) -1 fc 1/p A; or 

then ||A/3||oo < ||A/3|| p < set. Note that the condition HA/^Hoo < ea implies 
that suppn +£ ) Q (/3) C supp Q ,(/3) C supp^^ (/3) . This proves the desired re- 
sult. 



10.10. Proof of Corollary 7.1. We take p = oo, t = k and t = 0.5 in The- 
orem 7.1. Proposition 3.2 implies that k^j*^ ^ > 1 — M^(k + i)> 0.5 and 

^Tl+£ ^ — ^A^/ (1 ~~ -^a(^ + ^)) — ^-^ un der the conditions of the corollary. 

Therefore, the condition 8(£auA°°' ) )~ 1 A < t < 1 — 7TV° kit in Theorem 

' v Afc+r — — A,k+e,e ' 

7.1 holds. 



10.11. Proof of Corollary 7.2. We want to apply Theorem 7.1 with e = 
0.5, A = A n , 5 = 5 n = exp(— n s ),t = 0.5, a = a n = n~ s l 2 , d = d n , k = k n , and 
I = q n k n . Then, A = A n = 4(2 — t)t~ 1 (aay / 2\n(2d n /5 n )/n). The condition 

'2.(1+2*0*- " (1 + ^2,(1+2^ impUes that £/k ~ p A,k + J»A,k+2i - 
1 > 4(7r? ) 2 £ _1 , where the second inequality is due to Proposition 3.1. 

Therefore, the condition t < 1 — 7r j k ' 5 /£ is satisfied. 
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Now, from the conditions k n = o(n 1_s_s ') and k n \n[d n ) = o(n 1_s ), we 
have lim n ^ oc (y/k^X n /a n ) = 0. Since (^^ k+£ )~ 1 = 0(1), the condition a > 

16(t/i^ fc+ ^)~ 1 v / fcA is also satisfied when n is sufficiently large. 

Therefore, by Theorem 7.1, when n is sufficiently large, suppi 5Qri (/3 n ) C 
supp an (/3 n ) C supp 5ctn ((3 n ) with probability at least 1 — 8 n . Since 1/ 
min J - 6sU p Po ^ ji ) \Pn,j\ = o(n S//2 ), we know when n is sufficiently large, 
mm jGsu PPo (^) \Pnj\ > 2 «n- Thus, supp! 5Qn (/3 n ) = supp 5an (/3 n ) = supp (/3 n ) 
This means that supp (/3 n ) = supp an (/3 n ) with probability at least 1 — 5 n 
when n is sufficiently large. 

10.12. Proof of Theorem 8.1. Let F = supp A (/3). We would like to apply 
Lemmas 10.2, 10.4 and 10.3, with F = supp Q (/?) and Fq = supp x 5q ,(/3). 

First, Theorem 7.1 implies that, with probability larger than 1 — 6, 
su PPi.5a(£) c su PP«(/3) C supp 5Q (/3) C F. Moreover, 



( 7 ) Plloo < a<TA 



'21n(2d/<5) 



n 



This can be seen from the proof of Theorem 4.1 (which directly implies 
Theorem 7.1). 

Let Zj = [z iA , . . . ,z itd ] £ R d , so that z i|<7 - = if j € F and Z;j = 

0, otherwise. We thus have ip = Y^LiiYi ~ Eyj)zj. Since each y$ — Eyj 
is an independent sub-Gaussian random variable, and Ya=i INilll = 
J2j£F Z)r=i( x «,i/ n ) 2 — qa 2 /n, we obtain, from Proposition 10.2, that, with 
probability larger than 1 — 5, 

(8) \\e Fo \\ 2 < aa(l + ^2Q\a{l /S))yfqjn. 

Therefore, with probability exceeding 1 — 25, both (7) and (8) hold. There- 
fore, Lemma 10.4 implies that 

||A#|| 2 < -|_[4^W ^^PgHx + P%jPGh + + \\£ Fo \\ 2 ] 

+ \\M\* 
2 



+ Wah- 

(2) (2) (2) 

In the first inequality, we have used 7rV < 0- e J I 1 \\, (Proposi- 
tion 3.1). In the second inequality, we have used ||/3g||i < v^HA^Ih < V^ll/fclh) 
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and j „ „ < Pi, \fi (Proposition 3.1). By combining the above estimate 
with Lemma 10.3, we obtain the desired bound. 

10.13. Proof of Corollary 8.1. We take p = oo, £ = s, and t = 0.5 in The- 
orem 8.1. Proposition 3.2 implies that ^ A \ +i£ < M i (2s) 1 / 2 s/(l - M A 2s) < 

V^/4, uf >s+e > 1 - M A 2s > 2/3 and < M A s/(l - M A 2s) < 1/4. 

Now, it is clear that t < 1 — 7Tj , „ „k ' 5 II holds. Moreover, the con- 

' — A,k+£,£ ' 

dition WcaJ$ )~ 1 s l / p X < t < 1 - 7r ( f } s 1 ~ 1 /p/£ i s also valid. There- 

v A,s+l' — — A,s+l,l 1 

fore, Theorem 8.1 can be applied with ^ Ak+l > 1 - M A (k + £)> 2/3 and 
pf k+e <l + M A (k + £)<4/3. 

11. Conclusion. This paper considers the performance of least squares 
regression with L\ regularization from parameter estimation accuracy and 
feature selection quality perspectives. To this end, a general theorem is es- 
tablished in Section 4. 

An important consequence of this theorem is a performance bound for 
Lasso similar to that of [4] for the Dantzig selector. The detailed compari- 
son is given in Section 6. Our result gives an affirmative answer to an open 
question in [10] concerning whether a bound similar to that of [4] holds for 
Lasso. Another important consequence of Theorem 4.1 is the feature selec- 
tion quality of Lasso using a nonzero thresholding feature selection method, 
which extends the zero thresholding method considered in [20] . Our method 
can remove some limitations of [20], as discussed in Section 7. 

Moreover, we pointed out that the standard (one-stage) Lasso may be sub- 
optimal under certain conditions. However, the problem can be remedied by 
combining the parameter estimation and feature selection perspectives of 
Lasso. In Section 8, a two-stage L\ -regularization procedure with selective 
penalization was analyzed. In practice, if one is able to appropriately tune 
the thresholding parameter using cross-validation, then the procedure should 
not be much worse than the standard one-stage Lasso. Theoretically, it is 
shown that, if the target vector can be decomposed as the sum of a sparse 
parameter vector with large coefficients and another (less sparse) vector with 
small coefficients, then the two-stage Li-regularization procedure can lead 
to improved performance when d is large. 

Finally, we shall point out some limitations of our analysis. First, proce- 
dures considered in this work are not adaptive. For example, in the one-stage 
method, the regularization parameter A has to satisfy certain conditions that 
depend on t and the noise level a. In feature selection and the two-stage 
method, the threshold parameter a also needs to satisfy certain conditions. 
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Although, in practice, such parameters can be tuned using cross-validation, 
it still remains an interesting problem to come out with a theoretical pro- 
cedure for setting such parameters that leads to a so-called "adaptive" es- 
timation method. Moreover, although bounds in this paper can be applied 
in random design situations with small modifications, the results are in- 
complete for random design because the conditions on the design matrix 
(which is now random) needs to be shown to concentrate at a certain rate. 
Although a number of such results exist in the random matrix literature, 
a more general treatment with better integration is still needed in future 
work. 

Acknowledgments. The author would like to thank Trevor Hastie, Ter- 
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ferences between L\ regularization and the Dantzig selector. 
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